Gaussian Mixture Model for numerical data
CHAT: When did Carl Friedrich Gauß define the Normal/Gaussian distribution? (answer at the end of this section)
Here we are making the assumption that each point, given the true cluster assignment, is follows a Gaussian distribution. This stochastic framework allows us to determine the clusters via Maximum Likelihood and evaluate the fit using statistical measures, such as the BIC (Bayesian Information Criterion), which is defined as nr_parameters * log(nr_observations) minus 2 times the log-likelihood.
We fit configurations with 1 to 10 clusters, hence max_clusters=10. Ensure that the number of points >> number of clusters. Most other parameters are used to control the algorithm, that we do not investigate in detail. In practice, some sensitivity analysis will quickly show if deviating from the defaults makes a difference or not. Ideally, one at least observes convergence for large values of km_iter and em_iter. If not, the algorithm may be unstable and not fit for the data at hand.
library(ClusterR)
opt_gmm = Optimal_Clusters_GMM(dat_num_no_missings_scaled, max_clusters = 10,
criterion = "BIC", dist_mode = "eucl_dist",
seed_mode = "random_subset", seed= seed_var,
km_iter = 10, em_iter = 10, var_floor = 1e-10,
plot_data = T)

The fit becomes better if we increase the number of clusters from one, but eventually the improvement is lower than the penalization in the BIC formula. We’ll decide to use 7 clusters since this value is quite close to the minimum and we want to have a rather low number of clusters for our applications.
gmm_res = GMM(dat_num_no_missings_scaled,gaussian_comps = 7, dist_mode = "eucl_dist",
seed_mode = "random_subset", km_iter = 10, em_iter = 10, seed = seed_var)
We define a wrapper to easily obtain predictions from our Gaussian Mixture Model (GMM)
predictGMM <- function(x,dat) {
centroids = x$centroids
cov = x$covariance_matrices
w = x$weights
return (predict_GMM(dat,centroids,cov,w))
}
The resulting clusters are more evenly distributed
pred_gmm_res <- predictGMM(gmm_res,dat_num_no_missings_scaled)
summary(factor(pred_gmm_res$cluster_labels))
0 1 2 3 4 5 6
65 22 81 6 316 235 196
Since we deal with probabilistic models, we can compute the probability that each point x belongs to cluster k. Thus one often speaks of “soft clustering”. As long as no decision is taken, each point belongs to each cluster with a certain probability. Naturally, one assigns each point to the most likely cluster. We observe that the probabilities of the most likely cluster are nicely skewed to the right, indicating a clear cluster assignment for most points.
probab_of_majority_class <- apply(pred_gmm_res$cluster_proba,1,max)
hist(probab_of_majority_class)

YES OR NO: The 7 cluster distributions have independent components?
gmm_res$covariance_matrices
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.7045233 0.0000000001 0.9119202 1.5981679 1.41460231 1.605640946 0.56296429 0.043127237
[2,] 2.3970460 1.1809796578 1.5502328 0.8463268 1.41567610 3.897214112 10.48325213 1.094608677
[3,] 0.1494162 0.0000000001 0.8652210 1.0080487 0.22676706 0.013508312 0.11334347 1.697885553
[4,] 0.3497369 3.2622415977 0.6673540 0.4346947 0.95265254 0.640867547 0.41243678 58.533182013
[5,] 0.1441920 0.0000000001 1.2042783 1.2350323 0.08981686 0.003363012 0.01527854 0.001554085
[6,] 0.2207778 0.0000000001 0.7247845 0.6685620 0.03142149 0.013553045 0.06685488 0.003018756
[7,] 0.6988894 0.0000000001 0.9686870 0.4843733 0.56035609 0.904982843 0.66450006 0.014623842
What appears like a non-diagonal covariance matrix, is in fact something else: Each cluster distribution lives in an dim(dat_num_no_missings_scaled)[2] dimensional space, since one needs one dimension per variable, and has a diagonal covariance matrix stated in the rows. So the answer is yes, they do have independent components.
Note that the 2nd component of many clusters has a zero variance, thus there is no variation. The 2nd component refers to the 2nd feature, the number of engines.
Gauß defined the Normal distribution in 1809 in his work “Theoria motus corporum coelestium in sectionibus conicis solem ambientium”, which you can surprisingly even buy on Amazon.
Mixed data types
For data sets with numerical and categorical features, we assume a multivariate distribution that is Gaussian for numerical components and follows a Dirichlet Distribution (that is a multivariate extension of the beta distribution) for categorical features.
library(MixAll)
In contrast to the previous setting with only numerical variables, the optimal number of clusters is 5.
ldata = list(dat_no_missings_scaled[,mget(cat_vars)],
dat_no_missings_scaled[,mget(num_vars)])
lnames = c("categorical_pk_pjk","gaussian_pk_sjk")
set.seed(seed_var)
clustermix_res <- clusterMixedData(ldata, lnames,
nbCluster = 5:9,
strategy = clusterFastStrategy(),
criterion = "BIC")
summary(clustermix_res)
**************************************************************
* model name = categorical_pk_pjk
* model name = gaussian_pk_sjk
* nbSample = 789
* nbCluster = 5
* lnLikelihood = 18043.41
* nbFreeParameter= 349
* criterion name = BIC
* criterion value= -33758.72
**************************************************************
There is no need to re-fit the model with this package:
pred_clustermix <- clustermix_res@zi
# pred_clustermix <- clusterPredict(ldata,clustermix_res) # for new data one would need the predict function
table(pred_clustermix)
pred_clustermix
1 2 3 4 5
3 21 37 242 486
3 minute BREAK. Have a bio break if you need one, get a tea / coffee or simply a glass of water.